Insurance data was analyzed in order to study several varibles related to insurance costs. The data suggests that beneficiaries are located throughout the United States but there are differences in BMI based on the region. It also appears that most beneficiaries do not have children. The data also suggests that cost increases with age and is higher overall for those who smoke compared to those who do not smoke.
Description
Rows: 1,338
Columns: 7
$ age <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1…
$ sex <chr> "female", "male", "male", "male", "male", "female", "female",…
$ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74…
$ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0…
$ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ region <chr> "southwest", "southeast", "southeast", "northwest", "northwes…
$ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,…
This page offers a brief analysis of general distribution graphs of single variables.
Region
This bar chart looks at the number of insurance beneficiaries in different regions of the US. It appears that there are the most beneficiaries in the southeast region but all of the regions have a similar count of beneficiaries so the difference doesn’t appear to be very large. (corresponds with Assignment 6, question 3)
BMI
This histogram displays the distribution of BMI (Body Mass Index). I appears to be unimodal and symmetrical which would indicate that it is bell-shaped. (corresponds with Assignment 6, question 5)
Charges
This histogram displays the distribution of charges and it appears to be right skewed. In this case, it means that most clients pay lower charges while some pay very high charges, therefore, median should be used as the center of the data. (corresponds with Assignment 6, question 6)
Children
This pie chart represents the number of children per beneficiary. It appears that a majority of clients have no children and then as the number of children increases, the percentage of beneficiaries with that amount of children decreases with 5 children making the smallest percentage of households. (corresponds with Assignment 6 question 14)
This page contains information that analyzes the relationship of the varibles smoker, age and charges.
General:
The graphs on this page point to the large role that smoking status has on charges when modeled with the age variable. This is an important piece of information because if you wanted to model charges based on other variables in the data smoking status should be taken into account (question 13).
Distribution of smokers by region:
The stacked bar plot displays that there are fewer smokers than non-smokers in all regions. There is not a large difference in the distribution of smokers among the four regions.
Age and charges:
This scatterplot displays the relationship between age and charges. It appears that as age increases, charges also increase so the relationship is positive and linear.
Age, smoke status, and charges:
This scatterplot shows that smoking status plays a role in charges. Beneficiaries who do smoke pay higher costs than those who do not smoke. There also appears to be more varibality in charges for those who do smoke compared to those who do not. The relationship between age, smoking status and charges is positive and linear.
Age, smoker and charges:
There appears to be two distinct groups in this plot of age and charges among the smoking population as there is a large gap in the middle. It does appear that there is a linear positive relationship between age and charges among smokers. I am not sure that it makes much sense to use a smooth line as it is just sitting in blank space between the points and the large gap in the middle doesn’t make it the best thing to use to summarize the relationship between age and charges in this group of smokers. (question 11)
Age, non-smoker, and charges:
This plot displays the relationship between age and charges in the nonsmoker group. Using the smooth line in this case (versus in the smoker group) is able to summarize the data as the points are closer together with some outliers but not two distict groups as in the previous graph. (question 12)
This page contains graphs relating to BMI.
The first histogram represents the distribution of beneficiary BMI and appears to be mostly symmetrical and unimodal.
The second graph, a boxplot, represents the distribution of BMI based on specific regions. The box plot suggests that the amount of beneficiaries with the highest BMI reside in the southeast region and those with the lowest BMI reside in the northwest region. The northeast, northwest, and southwest regions have similar distributions and shapes while the southeast region differs from the others.
This page contains the graphs relating to the distribution of children and the relationship between children and charges.
The first chart, a pie chart, is looking at the number of children per beneficiary. It appears that a majority of clients have no children and then as the number of children increases, the percentage of beneficiaries with that amount of children decreases with 5 children making the smallest percentage of households.
The second chart, a box plot suggests that those with zero children pay the most amount of money while those with five pay the least amount. There are also more outliers in the zero children group than in any of the others. The IQR is largest in the 2 children families but it is similar to that of those with 3 children. Previous graphs displayed that most people have zero children so that may account for the large difference in charges among that group compared to the five children group which made up the smallest percentage of clients.
---
title: "Assignment 7"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: default
navbar-bg: "pink"
orientation: columns
vertical_layout: fill
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(DT)
library(plotly)
insurance<-read_csv("./insurance.csv")
```
Overview
===
Insurance data was analyzed in order to study several varibles related to insurance costs. The data suggests that beneficiaries are located throughout the United States but there are differences in BMI based on the region. It also appears that most beneficiaries do not have children. The data also suggests that cost increases with age and is higher overall for those who smoke compared to those who do not smoke.
Data
===
Column {data-width=550}
---
### <b><font size = 4><span Style = "color:pink">First 500 Observations</span></font></b>
```{r show_table}
datatable(insurance[1:500,], rownames = FALSE, colnames = c("Age", "Sex", "BMI", "Children", "Smoker", "Region", "Charges"), options=list( pageLength=20))
```
Column {data-width=450}
---
### <font size = 4><span Style = "color:pink">Description</span></font>
Description
- age: age of primary beneficiary
- sex: insurance contractor gender, female, male
- bmi: Body mass index, providing an understanding of body, weights that are relatively
high or low relative to height, objective index of body weight (kg / m2) using the ratio of
height to weight, ideally 18.5 to 24.9
- children: Number of children covered by health insurance / Number of dependents
- smoker: Smoking status
- region: the beneficiary's residential area in the US, northeast, southeast, southwest,
northwest.
- charges: Individual medical costs billed by health insurance
```{r}
glimpse(insurance)
```
General Distribution
===
This page offers a brief analysis of general distribution graphs of single variables.
Column {.tabset data-width=550}
---
### Region
```{r bar1}
ggplot(insurance,aes(x=region))+geom_bar(fill="#3295a8")+labs(title="Number of Beneficiaries by Reigon")+theme(text=element_text(size=20))
```
### BMI
```{r histogram1}
ggplot(insurance,aes(x=bmi))+geom_histogram(fill="#4535bd")+labs(title = "Distribution of Beneficiaries by BMI")+theme(text=element_text(size = 17))
```
### Charges
```{r histogram2}
ggplot(insurance,aes(x=charges))+geom_histogram(fill="#db70a0")+labs(title = "Distribution of Beneficiaries by Charges")+theme(text=element_text(size = 17))
```
### Children
```{r pie1}
children_pie <- count(insurance,children)
children_pie$percent <- round(children_pie$n/sum(children_pie$n)*100,2)
children_pie$children <- as.character(children_pie$children)
pie <- ggplot(children_pie, aes(x="", y= percent, fill = children))+geom_bar(stat = "identity", width = 1, color= "black")
pie <- pie + coord_polar("y", start = 0) + geom_text(aes(label=paste0(percent, "%")), fontface= "bold", color="black", position = position_stack(vjust = 0.5))
pie <- pie + scale_fill_brewer(palette = 3)
pie <- pie+theme_void()+theme(text=element_text(size=20))+labs(title = "Number of Children per Household")
pie
```
Column {data-width=450}
---
### Analysis
**Region**
This bar chart looks at the number of insurance beneficiaries in different regions of the US. It appears that there are the most beneficiaries in the southeast region but all of the regions have a similar count of beneficiaries so the difference doesn't appear to be very large. (corresponds with Assignment 6, question 3)
**BMI**
This histogram displays the distribution of BMI (Body Mass Index). I appears to be unimodal and symmetrical which would indicate that it is bell-shaped. (corresponds with Assignment 6, question 5)
**Charges**
This histogram displays the distribution of charges and it appears to be right skewed. In this case, it means that most clients pay lower charges while some pay very high charges, therefore, median should be used as the center of the data. (corresponds with Assignment 6, question 6)
**Children**
This pie chart represents the number of children per beneficiary. It appears that a majority of clients have no children and then as the number of children increases, the percentage of beneficiaries with that amount of children decreases with 5 children making the smallest percentage of households. (corresponds with Assignment 6 question 14)
Smoking Status and Age
===
This page contains information that analyzes the relationship of the varibles smoker, age and charges.
Column {.tabset data-width=550}
---
### Distribution of smokers by reigon
```{r bar2}
ggplot(insurance,aes(x=region,fill=smoker))+geom_bar(position = "fill")+scale_y_continuous(breaks = seq(0,1,by=0.2),labels = scales::percent)+labs(y="Percent", title = "Percentage of Beneficiary Smoking Status by Region ")+scale_fill_brewer(palette = "Set1")
```
### Age and charges
```{r scatter4}
ggplot(insurance,aes(x=age,y=charges))+geom_point(alpha=0.75,size=2,color="#2e785f")+labs(title="Relationship Between Age and Charges")+theme(text=element_text(size=16))
```
### Age, smoke status and charges
```{r scatter1}
ggplot(insurance,aes(x=age,y=charges,color=smoker))+geom_point(alpha=0.75)+labs(title = "Relationship Between Smoking Status, Age, and Charges")+scale_color_brewer(palette="Set1")
smoker<- filter(insurance, smoker == "yes")
nonsmoker <- filter(insurance, smoker == "no")
```
### Age, smoker and charges
```{r scatter2}
ggplot(smoker,aes(x=age,y=charges))+geom_point(color="#6042f5")+geom_smooth(se=FALSE, color= "black")+ labs(title="Relationship Between Age and Charges Among Smokers")
```
### Age, non-smoker and charges
```{r scatter3}
ggplot(nonsmoker,aes(x=age,y=charges))+geom_point(color="#6042f5")+geom_smooth(se=FALSE, color= "black")+ labs(title="Relationship Between Age and Charges Among Nonsmokers")
```
Column {data-width=450}
---
### Analysis
**General:**
The graphs on this page point to the large role that smoking status has on charges when modeled with the age variable. This is an important piece of information because if you wanted to model charges based on other variables in the data smoking status should be taken into account (question 13).
**Distribution of smokers by region:**
The stacked bar plot displays that there are fewer smokers than non-smokers in all regions. There is not a large difference in the distribution of smokers among the four regions.
**Age and charges:**
This scatterplot displays the relationship between age and charges. It appears that as age increases, charges also increase so the relationship is positive and linear.
**Age, smoke status, and charges:**
This scatterplot shows that smoking status plays a role in charges. Beneficiaries who do smoke pay higher costs than those who do not smoke. There also appears to be more varibality in charges for those who do smoke compared to those who do not. The relationship between age, smoking status and charges is positive and linear.
**Age, smoker and charges:**
There appears to be two distinct groups in this plot of age and charges among the smoking population as there is a large gap in the middle. It does appear that there is a linear positive relationship between age and charges among smokers. I am not sure that it makes much sense to use a smooth line as it is just sitting in blank space between the points and the large gap in the middle doesn't make it the best thing to use to summarize the relationship between age and charges in this group of smokers. (question 11)
**Age, non-smoker, and charges:**
This plot displays the relationship between age and charges in the nonsmoker group. Using the smooth line in this case (versus in the smoker group) is able to summarize the data as the points are closer together with some outliers but not two distict groups as in the previous graph. (question 12)
BMI
===
This page contains graphs relating to BMI.
Column {data-width=550}
---
### Distribution of BMI
```{r}
ggplot(insurance,aes(x=bmi))+geom_histogram(fill="#4535bd")+labs(title = "Distribution of Beneficiaries by BMI")+theme(text=element_text(size = 17))
```
### Distribution of BMI by Region
```{r}
ggplot(insurance,aes(x=region, y=bmi))+geom_boxplot(fill="#dadee3")+theme(text=element_text(size=16),panel.grid.major = element_blank())+labs(title = "Distribution of BMI based on Region")
```
Column {data-width=450}
---
### Analysis
The first histogram represents the distribution of beneficiary BMI and appears to be mostly symmetrical and unimodal.
The second graph, a boxplot, represents the distribution of BMI based on specific regions. The box plot suggests that the amount of beneficiaries with the highest BMI reside in the southeast region and those with the lowest BMI reside in the northwest region. The northeast, northwest, and southwest regions have similar distributions and shapes while the southeast region differs from the others.
Children
===
This page contains the graphs relating to the distribution of children and the relationship between children and charges.
Column {data-width=550}
---
### Distribution of Children
```{r}
children_pie <- count(insurance,children)
children_pie$percent <- round(children_pie$n/sum(children_pie$n)*100,2)
children_pie$children <- as.character(children_pie$children)
pie <- ggplot(children_pie, aes(x="", y= percent, fill = children))+geom_bar(stat = "identity", width = 1, color= "black")
pie <- pie + coord_polar("y", start = 0) + geom_text(aes(label=paste0(percent, "%")), fontface= "bold", color="black", position = position_stack(vjust = 0.5))
pie <- pie + scale_fill_brewer(palette = 3)
pie <- pie+theme_void()+theme(text=element_text(size=20))+labs(title = "Number of Children per Household")
pie
```
### Relationship between Chilren and Charges
```{r}
insurance$children <- as.character(insurance$children)
ggplot(insurance, aes(x= children, y=charges)) + geom_boxplot(fill="lightgray", color= "black")+labs(title= "Distribution of Charges based on Number of Children")
```
Column {data-width=350}
---
### Analysis
The first chart, a pie chart, is looking at the number of children per beneficiary. It appears that a majority of clients have no children and then as the number of children increases, the percentage of beneficiaries with that amount of children decreases with 5 children making the smallest percentage of households.
The second chart, a box plot suggests that those with zero children pay the most amount of money while those with five pay the least amount. There are also more outliers in the zero children group than in any of the others. The IQR is largest in the 2 children families but it is similar to that of those with 3 children. Previous graphs displayed that most people have zero children so that may account for the large difference in charges among that group compared to the five children group which made up the smallest percentage of clients.